library(dplyr)
## OR
library(tidyverse)
library(GGally)
library(correlation)05: Filter and Select
Overview
This tutorial covers two important {dplyr} functions: filter() and select(). Easy to confuse, filter() uses logical assertions to return a subset of rows (cases) in a dataset, while select() returns a subset of the columns (variables) in the dataset.
To remember which does which:
filter()works on rows, which starts with “r”, so it contains the letter “r”.select()works on columns, which starts with “c”, so it contains the letter “c”.
Setup
Packages
We will be focusing on {dplyr} today, which contains both the filter() and select() functions. You can either load {dplyr} alone, or all of {tidyverse} - it won’t make a difference, but you only need one or the other.
We will also make use of the {GGally} package later on for some snazzy visualisations and {correlation} for…well, I’ll give you three guesses!
Data
Today we’re going to start working with a dataset that we’re going to get familiar with over the next few weeks. Courtesy of fantastic Sussex colleague Jenny Terry, this dataset contains real data about statistics and maths anxiety.
Codebook
There’s quite a bit in this dataset, so you will need to refer to the codebook below for a description of all the variables.
This study explored the difference between maths and statistics anxiety, widely assumed to be different constructs. Participants completed the Statistics Anxiety Rating Scale (STARS) and Maths Anxiety Rating Scale - Revised (R-MARS), as well as modified versions, the STARS-M and R-MARS-S. In the modified versions of the scales, references to statistics and maths were swapped; for example, the STARS item “Studying for an examination in a statistics course” became the STARS-M item “Studying for an examination in a maths course”; and the R-MARS item “Walking into a maths class” because the R-MARS-S item “Walking into a statistics class”.
Participants also completed the State-Trait Inventory for Cognitive and Somatic Anxiety (STICSA). They completed the state anxiety items twice: once before, and once after, answering a set of five MCQ questions. These MCQ questions were either about maths, or about statistics; each participant only saw one of the two MCQ conditions.
For learning purposes, I’ve randomly generated some additional variables to add to the dataset containing info on distribution channel, consent, gender, and age. Especially for the consent variable, don’t worry: all the participants in this dataset did consent to the original study. I’ve simulated and added this variable in later to practice removing participants.
| Variable | Type | Description |
|---|---|---|
| id | Categorical | Unique ID code |
| distribution | Categorical | Channel through which the study was completed, either "preview" or "anonymous" (the latter representing "real" data). Note that this variable has been randomly generated and does NOT reflect genuine responses. |
| consent | Categorical | Whether the participant read and consented to participate ("Yes") or not ("No"). Note that this variable has been randomly generated and does NOT reflect genuine responses; all participants in this dataset did originally consent to participate. |
| gender | Categorical | Gender identity, one of "female", "male", "non-binary", or "other/pnts". "pnts" is an abbreviation for "Prefer not to say". Note that this variable has been randomly generated and does NOT reflect genuine responses. |
| age | Numeric | Age in years. Note that this variable has been randomly generated and does NOT reflect genuine responses. |
| mcq | Categorical | Independent variable for MCQ question condition, whether the participant saw MCQ questions about mathematics ("maths") or statistics ("stats"). |
| stars_[sub][number] | Numeric | Item on the Statistics Anxiety Rating Scale. There are three subscales, denoted with [sub] in the name:<br>- [test]: Test anxiety<br>- [help]: Asking for Help<br>- [int]: Interpretation Anxiety.<br>[num] corresponds to the item number. Responses given on a Likert scale from 1 (no anxiety) to 5 (a great deal of anxiety), so higher scores reflect higher levels of anxiety. |
| stars_m_[sub][number] | Numeric | Item on the Statistics Anxiety Rating Scale - Maths, a modified version of the STARS with all references to statistics replaced with maths. There are three subscales, denoted with [sub] in the name:<br>- [test]: Test anxiety<br>- [help]: Asking for Help<br>- [int]: Interpretation Anxiety.<br>[num] corresponds to the item number. Responses given on a Likert scale from 1 (no anxiety) to 5 (a great deal of anxiety), so higher scores reflect higher levels of anxiety. |
| rmars_[sub][number] | Numeric | Item on the Revised Maths Anxiety Rating Scale. There are three subscales, denoted with [sub] in the name:<br>- [test]: Test anxiety<br>- [num]: Numerical Task Anxiety<br>- [course]: Course anxiety.<br>[num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 5 (very much), so higher scores reflect higher levels of anxiety. |
| rmars_s_[sub][number] | Numeric | Item on the Revised Maths Anxiety Rating Scale - Statistics, a modified version of the MARS with all references to maths replaced with statistics. There are three subscales, denoted with [sub] in the name:<br>- [test]: Test anxiety<br>- [num]: Numerical Task Anxiety<br>- [course]: Course anxiety.<br>[num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 5 (very much), so higher scores reflect higher levels of anxiety. |
| sticsa_trait_[number] | Numeric | Item on the State-Trait Inventory for Cognitive and Somatic Anxiety, Trait subscale. [num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 4 (very much so), so higher scores reflect higher levels of anxiety. |
| sticsa_[time]_state_[number] | Numeric | Item on the State-Trait Inventory for Cognitive and Somatic Anxiety, State subscale. [time] denotes one of two times of administration: before completing the MCQ task ("pre"), or after ("post"). [num] corresponds to the item number. Responses given on a Likert scale from 1 (not at all) to 4 (very much so), so higher scores reflect higher levels of anxiety. |
| mcq_stats_[num] | Categorical | Correct (1) or incorrect (0) response to MCQ questions about statistics, covering mean ([number] = 1), standard deviation (2), confidence intervals (3), beta coefficient (4), and standard error (5). |
| mcq_maths_[num] | Categorical | Correct (1) or incorrect (0) response to MCQ questions about maths, covering mean ([number] = 1), standard deviation (2), confidence intervals (3), beta coefficient (4), and standard error (5). |
Filter
The filter() function’s primary job is to easily and transparently subset the rows within a dataset - in particular, a tibble. filter() takes one or more logical assertions and returns only the rows for which the assertion is TRUE. Columns are not affected by filter(), only rows.
General Format
Filtering with Assertions
The logical_assertion in the general format above is just like the assertions we saw in the first tutorial. The rows where the assertion returns TRUE will be included in the output; those that return FALSE will not. Inside the filter() command, use the names of the variable in the piped-in dataset to create the logical assertions.
As a first example, let’s use some of our familiar operators from the first tutorial. To retain only people who completed the maths MCQs, we can run:
- 1
-
Take the dataset
anx_data, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
mcqvariable is exactly and only equal to"maths".
So, the tibble we get as output contains cases that have the value "maths", and NOT "stats", nor any NAs (because NA does not equal "maths"!).
Remember that for exact matches like this, we must use double-equals == and not single-equals =. If you use single equals, you’re not alone - this is such a common thing that the (incredibly friendly and helpful) error message tells you what to do to fix it!
anx_data |>
dplyr::filter(mcq = "maths")Error in `dplyr::filter()`:
! We detected a named input.
ℹ This usually means that you've used `=` instead of `==`.
ℹ Did you mean `mcq == "maths"`?
Naturally, we can also filter on numeric values. If we wanted to keep only participants younger than 40 years old, we can filter as follows:
- 1
-
Take the dataset
anx_data, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
agevariable is less than 40.
As a final example, let’s consider a situation where we want to retain only participants that gave a gender identity of either “male” or “female”.2
To do this, we need a new operator: %in%, which God knows I just pronounce as “in” (try saying “percent-in-percent” three times fast!). This looks for any matches with any of the elements that come after it:
- 1
-
Take the dataset
anx_dat, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
gendervariable matches any of the values “female” or “male”.
==?
What follows here is a rabbit hole that gets into some gritty detail. If you’re happy to take my word for it that you absolutely, definitely needed %in% and not == in the previous exercise, you can skip the explanation below. If you’re keen to understand all the nuance, click to expand and read on!
== vs %in%
For this matching task, you might have thought we’d use gender == c("female", "male"), which runs successfully and sure looks okay. So why isn’t this right?
## DO NOT DO THIS
anx_data |>
## THIS DOES NOT DO WHAT WE WANT!!
dplyr::filter(gender == c("female", "male"))## DANGER WILL ROBINSONAt a glance it looks like this produces the same output as the solution above - gender now contains only male or female participants. As you might have gathered from the all-caps comments above - intended to prevent you from accidentally using this code in the future for tasks like this - this is NOT what this code does.
To demonstrate what it does do, I need the dplyr::mutate() function from the next tutorial to create some new variables. The first new variable, double_equals, contains TRUEs and FALSEs for each case using the assertion with ==. The second is exactly the same, but reverses the order of the genders - something that should NOT make a difference to the matching! (We want either female OR male participants, regardless of which we happen to write first.) The third, in_op, contains the same again but this time with %in%. The final arrange() line sorts the dataset by gender to make the output easier to read.
anx_data |>
dplyr::mutate(
double_equals = (gender == c("female", "male")),
double_equals_rev = (gender == c("male", "female")),
in_op = (gender %in% c("female", "male")),
.keep = "used"
) |>
dplyr::arrange(gender)Notice anything wild?
For participants with the same value in gender, the assertions with == both flip between TRUE and FALSE, but in the reverse pattern to each other. The assertion with %in% correctly labels them all as TRUE. WTF?
What’s happening is that because the vector c("female", "male") contains two elements, the assertion with == matches the first case to the first element - female - and returns TRUE. Then it matches the second case to the second element - male - and this time returns FALSE. Then because there are more cases, it repeats: the next (third) case matches female and returns TRUE, the next male and FALSE, and so forth. The == assertion with the gender categories reversed does the same, but starts with male first and female second. Only %in% actually does what we wanted, which was to return TRUE for any case that matches female OR male.
This is a good example of what I think of as “dangerous” code. I don’t mean “reckless” or “irresponsible” - R is just doing exactly what I asked it to do, and it’s not the job of the language or package creators to make sure my code is right. I mean dangerous because it runs as expected, produces (what looks like) the right output, and even with some brief checking, would appear to contain the right cases - but would quietly result in a large chunk of the data being wrongly discarded. If you didn’t know about %in%, or how to carefully double-check your work, you could easily carry on from here and think no more about it.
So, how can we avoid a problem like this? Think of any coding task - especially new ones, where you’re not completely familiar with the code or functions you’re working with - as a three-step process3.
- Anticipate. Form a clear picture of the task you are trying to achieve with your code. What do you expect the output of the code to look like when it runs successfully?
- Execute. Develop and run the code to perform the task.
- Confirm. Compare the output to your expectations, and perform tests to confirm that what you think the code has done, is in fact what it has done.
So, what might the Confirm step look like for a situation like this?
One option is the code I created above, with new columns for the different assertion options - but this might be something you’d only think to do if you already knew about %in% or suspected there was a problem. A more routine check might look like:
I expect that when my filtering is accomplished, my dataset will contain all and only the participants who reported a gender identy of female or male, and no others. I will also have the same number of cases as the original dataset, less the number of other gender categories.
First, I’ll create a new dataset using the filtered data.
## SERIOUSLY THIS IS BAD
anx_data_bd <- anx_data |>
## DON'T USE THIS CODE FOR MATCHING
dplyr::filter(gender == c("female", "male"))
## STOP OH GOD PLEASE JUST DON'TCheck 1: Filtered data contains only male and female participants.
anx_data_bd |>
dplyr::count(gender)Only female and male participants! Tick ✅
At this point, though, I might become suspicious. The original dataset contained 465 cases - we’ve lost more than half! Can that be right?? Better check the numbers.
## Get the numbers from the original dataset
anx_data |>
dplyr::count(gender)Uh oh. Already we can see that something’s wrong with the numbers. But instead of relying on visual checks, let’s let R tell us.
## Calculate how many cases we expect if the filtering had gone right
expected_n <- anx_data |>
dplyr::count(gender) |>
## This isn't the best way to filter
dplyr::filter(gender != "non-binary") |>
## The next section on multiple assertions has a much better method!
dplyr::filter(gender != "other/pnts") |>
dplyr::pull(n) |>
sum()
## Ask R whether the expected number of rows is equal to the actual number of rows in the filtered data
expected_n == nrow(anx_data_bd)[1] FALSE
Now we know for sure there’s a problem and can investigate what happened more thoroughly.
As a final stop on this incredibly lengthy detour (are you still here? 👋), you might wonder whether the check above would give me the wrong answer, because I used two filter()s in a row, and the whole point of this goose chase is how to accomplish that exact filtering task. First, this is NOT the way I would do this (as the comments suggest), but I’m really trying to stick to ONLY what we’ve already covered wherever possible. But let’s say I’d tried to do this with the bad == filtering that caused all this faff in the first place.
For this particular case there are four values in gender. If I try gender == c("female", "male") here, this DOES actually work fine - because the categories are in the right order and are a multiple of the length of the dataset 🤦 But at least the numbers still wouldn’t match, which would tell me that something went wrong with filtering the whole dataset.
anx_data |>
dplyr::count(gender) |>
dplyr::filter(gender == c("female", "male"))If I happened to have had the genders the other way round, I would have got an empty tibble, and hopefully that also would have clued me in that there was a problem with the original filtering.
anx_data |>
dplyr::count(gender) |>
dplyr::filter(gender == c("male", "female"))Multiple Assertions
Logical assertions can also be combined to specify exactly the cases you want to retain. The two most important operators are:
&(AND): Only cases that returnTRUEfor all assertions will be retained.|(OR): Any cases that returnTRUEfor at least one assertion will be retained.
Let’s look at a couple minimal examples to get the hang of these two symbols. For each of these, you can think of the single response R gives as the answer to the questions, “Are ALL of these assertions true?” for AND, and “Is AT LEAST ONE of these assertions true?” for OR.
First, let’s start with a few straightforward logical assertions:
"apple" == "apple"[1] TRUE
23 > 12[1] TRUE
42 == "the answer"[1] FALSE
10 > 50[1] FALSE
Next, let’s look at how they combine.
Two true statements, combined with &, return TRUE, because it is true that all of these assertions are true.
"apple" == "apple" & 23 > 12[1] TRUE
Two true statements, combined with |, also return TRUE, because it true that at least one of these assertions is true.
"apple" == "apple" | 23 > 12[1] TRUE
Two false statements, combined with &, return FALSE, because it is NOT true that all of them are true.
42 == "the answer" & 10 > 50[1] FALSE
Two false statements, combined with |, return FALSE, because it is NOT true that at least one of them is true.
42 == "the answer" | 10 > 50[1] FALSE
One true and one false statement, combined with &, return FALSE, because it is NOT true that all of them are true.
23 > 12 & 42 == "the answer"[1] FALSE
One true and one false statement, combined with |, return TRUE, because it is true that at least one of them is true.
23 > 12 | 42 == "the answer"[1] TRUE
To see how this works, let’s filter anx_data to keep only cases that saw the stats MCQs, OR that scored 3 or higher on the first STARS test subscale item.
This requires two separate statements, combined with | “OR”:
- 1
-
Take the dataset
anx_data, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
mcqvariable is only and exactly equal to"stats", OR - 4
-
The value in
stars_test1is greater than or equal to 3.
Data Cleaning
Filtering is absolutely invaluable in the process of data cleaning. In order to practice this process, I’ve introduced some messy values into the data, so let’s have a look at a method of cleaning up the dataset and documenting our changes as we go.
Pre-Exclusions
For data collected on platforms like Qualtrics, you can frequently test out your study via a preview mode. Responses completed via preview are still recorded in Qualtrics, but labeled as such in a variable typically called “DistributionChannel” or similar. In this dataset, we have a similar variable, distribution, that labels whether the data was recorded in a preview ("preview") or from real participants ("anonymous").
Your method may vary, but I wouldn’t bother to document these cases as “exclusions” because they aren’t real data. I would just drop them from the dataset - but of course make sure to record the code that does so!
Recording Exclusions
As a part of complete and transparent reporting, we will want to report all of the reasons we excluded cases from our dataset, along with the number excluded. We can build this counting process into our workflow so that at the end, we have a record of each exclusion along with initial and final numbers.
For each check below, our recording process will have two steps:
- Produce a dataset of the cases you will exclude, and count the number of rows (cases).
- Remove the cases and overwrite the old dataset with the new one.
In my process, I’m going to keep anx_data as the original, “raw” version of the dataset. So, I’ll create a copy in a new dataset object to use while “processing” that I will update as I go.
anx_data_proc <- anx_dataTo begin, we will count the initial number of cases before any exclusions.
n_initial <- nrow(anx_data_proc)
n_initial[1] 453
(Remember that we can use nrow() because there is only one participant per row. If we had long-form data with observations from the same participant across multiple rows, we would have to do something a bit different!)
Consent
For many datasets, you would likely have a variable with responses from your participants about informed consent. How you filter this depends on what that variable contains, of course. However, we’ve already seen examples of this kind of operation earlier in this tutorial.
For the first assertion, we capture any responses that don’t match “Yes”, but for the second, we need to use a function from a family we met all the way back in Tutorial 01/02, namely is.na().
You can think of is.na() as a question about whatever is in its brackets: “Is (this) NA?” If the value IS an NA, R will return TRUE; if it’s anything else at all, R will return FALSE. So, to get an accurate count, we need to capture people who either answered something other than “Yes”, or didn’t answer at all.
n_no_consent <- anx_data_proc |>
dplyr::filter(consent != "Yes" | is.na(consent)) |>
nrow()
n_no_consent[1] 33
Then, we remove all participants who did not actively consent and assign the resulting dataset to the same name, overwriting the previous version. As we saw before, the below would discard cases that answered “No” (along with any other value not exactly matching “Yes”) and cases with NAs from people who didn’t answer.
anx_data_proc <- anx_data_proc |>
dplyr::filter(consent == "Yes")Age
For low-risk ethics applications, you may want to exclude people who reported an age below the age of informed consent (typically 18). This may look like age >= 18 or similar in your dataset. However, it’s also important to check for errors or improbable ages, or to remove any participants that are too old if your study has an upper age limit. In this case, my hypothetical study didn’t have an upper age limit, but I’ll designate any ages as 100 or above as unlikely to be genuine responses
Since these are removed for two different reasons, I’ll save them as two separate objects.
## Store the number to be removed
n_too_young <- anx_data_proc |>
dplyr::filter(age < 18) |>
nrow()
n_too_young[1] 22
n_too_old <- anx_data_proc |>
dplyr::filter(age >= 100) |>
nrow()
n_too_old[1] 5
## Remove them
anx_data_proc <- anx_data_proc |>
dplyr::filter(
dplyr::between(age, 18, 99)
)Missing Values
Finally (for now), just about any study will have to decide how to deal with missing values. The possibilities for your own work are too complex for me to have a guess at here, so for now we’ll only look at how to identify and remove missing values.
Single Variable
Let’s look at a single variable to begin with - for example, sticsa_trait_3. We can confirm that this variable has a/some NAs to consider by counting the unique values:
anx_data |>
dplyr::count(sticsa_trait_3)The first thing you might think to try is to filter on sticsa_trait_3 == NA, but weirdly enough this doesn’t work. Instead, we again need the increasingly versatile is.na(), which again, we can think of is.na() as a question about whatever is in its brackets: “Is (this) NA?” Let’s see this in action:
- 1
-
Take the dataset
anx_data_proc, and then - 2
- Filter it keeping only the cases where the following assertion is true:
- 3
-
The value in the
sticsa_trait_3variable IS missing (isNA).
These are the cases we want to remove, so we count how many there are and assign that number to a useful object name, as we did before.
n_sticsa_t3_missing <- anx_data_proc |>
dplyr::filter(
is.na(sticsa_trait_3)
) |>
nrow()
n_sticsa_t3_missing[1] 3
Next, we need to actually exclude these cases. This time, we want to retain the inverse of the previous filtering requirement: that is, we only want to keep the cases that are NOT missing a value in sex, the opposite of what we got from is.na(sticsa_trait_3). You may recognise “the inverse” or “not-x” as something we’ve seen before with !=, “not-equals”. For anything that returns TRUE and FALSE, you can get the inverse by putting an ! before it. (Try running !TRUE, for example!)
So, to create my clean anx_data_final dataset, I can use the assertion !is.na(sticsa_trait_3) to keep only the participants who answered this question - who do NOT have a missing value.
Finally, I can store the actual number of usable cases, according to my cleaning requirements, in a final object to use when reporting.
anx_data_final <- anx_data_proc |>
dplyr::filter(
!is.na(sticsa_trait_3)
)
n_final <- nrow(anx_data_final)
n_final[1] 390
All Variables
Removing NAs is a tricky process, but if you’re sure that you want to drop all cases with missing values in your dataset, there are few helper functions to make this easy.
For this, we’re going to leave filter() for a moment at look at a different function, tidyr::drop_na(). This function takes a tibble as input, and returns the same tibble as output, but with any rows that had missing values removed.
This is a pretty major step and should be used with caution! If we didn’t check our data carefully, we could easily end up dropping a bunch of cases we didn’t want to get rid of.
For example, if we apply it uncautiously here:
anx_data_proc |>
tidyr::drop_na()Well, there goes all our data!
Reporting
Select
The select() function is probably the most straightforward of the core {dplyr} functions. Its primary job is to easily and transparently subset the columns within a dataset - in particular, a tibble. Rows are not affected by select(), only columns.
General Format
To subset a tibble, use the general format:
1dataset_name |>
2 dplyr::select(
3 variable_to_include,
4 -variable_to_exclude,
5 keep_this_one:through_this_one,
6 new_name = variable_to_rename,
7 variable_number
)- 1
-
Take the dataset
dataset_name, and then - 2
- Select the following variables:
- 3
- The name of a variable to be included in the output. Multiple variables can be selected separated by commas.
- 4
-
The name of a variable to be excluded from the output. Use either an exclamation mark (
!) or a minus sign (-) in front of each variable to exclude. Multiple variables can be dropped, separated by commas with a!(or-) before each. - 5
-
A range of variables to include in the output. All the variables between and including the two named will be selected (or dropped, with
!(drop_this_one:through_this_one)). - 6
-
Include
variable_to_renamein the output, but call itnew_name. - 7
- Include a variable in the output by where it appears in the dataset, numbered left to right. For example, “2” will select the second column in the original dataset.
Columns will appear in the output in the order they are selected in select(), so this function can also be used to reorder columns.
Selecting Directly
The best way to get the hang of this will be to give it a go, so let’s dive on in!
That’s really all there is to it!
…Or is it?5
Using {tidyselect}
The real power in select(), and in many other {tidyverse} functions, is in a system of helper functions and notations collectively called <tidyselect>. The overall goal of “<tidyselect> semantics” (as you will see it referred to in help documentation) is to make selecting variables easy, efficient, and clear.
At UG level at Sussex, students are not taught about <tidyselect> in core modules. However, <tidyselect> is desperately useful and makes complex data wrangling/cleaning a lot faster and more efficient, especially (for instance) for questionnaires with similarly-named subscales, so would make for a great collaborative activity with supervisors!
These helper functions can be combined with the selection methods above in any combination. Some very convenient options include:
everything()for all columnsstarts_with(),ends_with(), andcontains()for selecting columns by shared name elementswhere()for selecting with a function, described in the next section
Rather than list examples of all the helper functions here, it’s best to just try them out for yourself!
Using Functions
Let’s say we want to generate a summary table of the variables in our dataset. Before we can create our summary in the next tutorial, we may first want to produce a subset of our dataset that only contains numeric variables.
To do this, we can use the <tidyselect> helper function where(). This helper function lets us use any function that returns TRUE and FALSE to select columns. Essentially, we don’t have to select columns using name or position - we can use any criteria we want, as long as we have (or can create…!) a function that expresses that criteria.
Especially helpful here is the is.*() family of functions in base R. This group of functions all have the same format, where the * is a stand-in for any type of data or object, e.g. is.logical(), is.numeric(), is.factor() etc. (The very useful is.na() that we’ve seen with filter() above is also a member of this family.) These functions work like a question about whatever you put into them - for example, is.numeric() can be read as, “Is (whatever’s in the brackets) numeric data?”
You can quickly find all of the functions in this family by typing is. in a code chunk and pressing Tab.
Putting these two together, we could accomplish the task of selecting only numeric variables as follows:
anx_data |>
dplyr::select(
where(is.numeric)
)This command evaluates each column and determines whether they contain numeric data (TRUE) or not (FALSE), and only returns the columns that return TRUE.
Using Custom Functions
The following material in this section isn’t covered in the live workshops. It’s included here for reference because it’s extremely useful in real R analysis workflows, but it won’t be essential for any of the workshop tasks.
The function in where() that determines which columns to keep doesn’t have to be an existing named function. Another option is to use a “purrr-style lambda” or formula (a phrase you may see in help documentation) to write our own criteria on the spot.
For example, let’s select all of the numeric variables that had a mean of 3 or higher:
anx_data |>
dplyr::select(
where(~is.numeric(.x) & mean(.x, na.rm = TRUE) >= 3)
)Instead of just the name of a function, as we had before, we now have a formula. This formula has a few key characteristics:
- The
~(apparently pronounced “twiddle”!) at the beginning, which is a shortcut for the longerfunction(x) ...notation for creating functions. - The
.x, which is a placeholder for each of the variables that the function will be applied to.
So, this command can be read: “Take my tibble and select all the columns where the following is true: the data type is numeric AND the mean value in that column is greater than or equal to 3 (ignoring missing values).”
Quick Test: t-test and \(\chi^2\)
This portion of the tutorial is still under construction. Check back another time!
Footnotes
This incredibly useful property is called “data masking”. If you want to know more, run
vignette("programming")in the Console.↩︎I’m not wild about this example - the experiences of non-binary and other genders are just as important! Unfortunately it’s the only variable in the dataset with the right number of categories.↩︎
I did try to think of a snazzy acronym here, but all I came up with is AEC (yikes). I’ll keep thinking and try to update this with something better, and I welcome suggestions if you’ve made it this far!↩︎
I’m sure Jenny would tell you there’s a little more to it than that, especially with 12,570 students from 100 universities in 35 countries, collected in 21 languages! But that’s both the dream and the general idea.↩︎
Have you seen the size of this tutorial?? Of course it isn’t!↩︎